1 

Convergence and Applications of a Gossip-based 

Gauss-Newton Algorithm 

Xiao Li, Student Member, IEEE, and Anna Scaglione, Fellow, IEEE 



Abstract — The Gauss-Newton algorithm is a popular and 
efficient centralized algorithm for solving non-linear least squares 
problems. In a large network, however, distributed observations 
are usually aggregated at a fusion center in order to apply the 
algorithm centrally, which creates inevitable communication and 
storage bottlenecks. In this paper, we study a distributed version 
of Gauss-Newton algorithm via gossiping, and show the con- 
vergence of this Gossip-based Gauss-Newton (GGN) algorithm. 
As an example, we show numerically that the proposed GGN 
algorithm is effective and robust in solving power system state 
estimation problems, and that the Mean Square Error (MSE) 
performance remains comparable to the centralized scheme and 
degrades gracefully even when the network exhibits random 
link/node failures. 

Index Terms — Gauss-Newton algorithm, gossiping, distributed, 
convergence 

I. Introduction 

Numerical algorithms for solving non-linear least squares 
(NLLS) problems are well studied and understood |1|. The 
most popular solution to NLLS problems is through the so 
called Newton and Gauss-Newton method. Newton algorithms 
are second order methods that use the Hessian of the objective 
function to stabilize and accelerate local convergence 0,0, 
while Gauss-Newton simplifies the computation of the Hessian 
particularly for NLLS problems by ignoring the high order 
derivatives |4|. The Gauss-Newton algorithm is often used in 
NLLS problems due to its convenience and simplicity, with 
applications including (but not limited to) power systems |5l, 
localization |6|, frequency estimation |7|, Kalman filtering |8|, 
medical imaging | 9 | and so on. 

In a large network, the most common sensing architecture 
includes a central node that aggregates all the acquired data 
from the distributed agents and implement these algorithms 
centrally. Unfortunately, due to the scale of a networked 
system, centralized processing schemes do not scale well, 
because of the information bottleneck as well as the overhead 
that is required to support reliable aggregation of data. Also, 
these methods typically rely upon aggregation trees and are, 
therefore, susceptible to single-point failures, denial-of- service 
attacks due to network congestions and to random link or node 
failures. All the features that are lacking in centralized schemes 
can be found in gossip algorithms. Since their introduction 
|T0| , they have been extensively investigated |TT|, |T2|, as 
surveyed in f73| . Deterministic and randomized protocols for 
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gossip algorithms with synchronous or asynchronous updates 
have been further studied [14 J , |15 | and applied in different 
areas in networked control and distributed signal processing, 
such as distributed Kalman filtering fT6l for estimation and 
tracking in a network, or convex optimization problems using 
gossip-based sub-gradient updates |T7| . 

Particularly relevant to gossip algorithms are recent ad- 
vances made in distributed optimization via network diffusion, 
which evolved from incremental methods 1 18], 1 19] and gossip 
algorithms p7| onto fully decentralized and randomized algo- 
rithms. The distributed algorithms analyzed in f20l-f24l tackle 
convex optimization problems through either synchronous or 
asynchronous communications. These techniques combine a 
local descent step with a network diffusion step. The conver- 
gence of these diffusion algorithms typically requires convexity 
and a diminishing step-size, which results in slow conver- 
gence in general |25|. Recently, |26| proposed a diffusion 
optimization scheme for general non-linear convex problems 
with a constant step-size by assuming local strong convexity. 
Furthermore, convergence analysis of network diffusion algo- 
rithms has also been developed for adaptive signal processing 
on streaming observations using a constant step-size |[27|- 
f29l for linear filtering problems, or using a diminishing 
step-size [24J for non-linear invertible systems. Despite the 
simplicity of first order methods used in diffusion algorithms, 
they generally suffer from slow convergence in contrast to 
Newton-type algorithms. 

To enhance the speed of first order methods, convergence 
results of a gossip-based Newton method are derived in |3Q| 
for network utility maximization problems. The algorithm 
relies on the diagonal structure of the Hessian matrix, and 
this approach is later applied to power flow estimation pi] . 
We note, however, that the convergence of the distributed 
Newton method is proven under they hypothesis that the error 
of the computed Newton descent is bounded, which is assumed 
to be true but not verified. Last but not least, the method 
is developed specifically for strictly convex problems, where 
the variables are completely separable for each distributed 
agent, while many NLLS problems are oftentimes non-convex 
and inseparable, and thus it is unclear how these methods 
perform in practice if applied directly. There are also some 
ad-hoc applications of Gauss-Newton methods via network 
average consensus in sensor networks p2|-p4| or incremental 
methods in acoustic sources localization | |35| and yet these 
works use gossiping to compute the algorithm update for the 
problems at hand, lacking the analysis of their convergence 
and performance. 
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A. Contributions and Applications 



The contributions of this paper are stated as follows. Firstly, 
we study a Gossip-based Gauss-Newton (GGN) algorithm for 
general NLLS problems, which exhibits much faster conver- 
gence than first order network diffusion algorithms. This algo- 
rithm only requires flexible near-neighbor communications and 
the communication graph can be time-varying. Secondly, we 
present an analysis on the local convergence and performance 
of the GGN algorithm, which can serve as application guide- 
lines in a general setting. This is especially useful for problems 
that arise in |32|-|34|, which uses a similar approach in an 
ad-hoc manner without discussing related numerical issues. 
Finally, we showcase the application of the GGN algorithm 
in power system state estimation | [36| , jSTl, which is a classic 
NLLS problem in power systems that has gained considerable 
interests and popularity lately to realize wide-area distributed 
control. We note that there has been tremendous effort spent 
on developing distributed state estimation schemes to alleviate 
the computation burden of the centralized processing p8|- 
pTj . Most of these algorithms distribute the computations 
by hierarchically aggregating or fusing the information and 
state estimates from distributed control areas, which assumes 
that there are redundant local measurements available at each 
area to uniquely identify the local state variables (i.e., local 
observability), which we do not assume in this paper. 

Recently, there are methods proposed for distributed state 
estimation that do not require hierarchical aggregation nor 
local observability (48} , (491. In comparison, the proposed 
GGN algorithm is very different in terms of the network 
communications and algorithm convergence. The method in 
(481 is based on the diffusion algorithm motivated by |24| | 
(similar to |20| in a non-adaptive setting), which is a first 
order method requiring coordination among all the agents and 
resulting in slow convergence. Our approach converges much 
faster, and the communication model we imposed, which can 
be either coordinated or uncoordinated, is much more flexible 
and robust. On the other hand, |49 1 follows a similar approach 
in |42|-|44| and uses the alternating direction method of mul- 
tipliers (AD-MoM) to distribute the state estimation procedure 
by decomposing the state variables in different areas so that 
each agent estimates a local state, in contrast to the global 
state considered in this paper. Furthermore, the communica- 
tions entailed by AD-MoM is constrained by the power grid 
topology, while the communication model considered in this 
paper is decoupled from the grid topology and more flexible in 
terms of network reconfigurations and random failures. Also, 
the numerical tests in (49j are based exclusively on a linear 
model using PMU data, while the algorithm convergence in 
general is not discussed. In contrast, the convergence of the 
proposed GGN algorithm in this paper is thoroughly analyzed. 
In the simulations, we show the performance of our GGN 
algorithm against the algorithm tailored for power system 
state estimation in |48| and its generalized version (24| in 
an adaptive setting with real-time streaming measurements. 



B. Notation and Paper Organization 

We denote vectors and matrices by boldface lower-case and 
boldface upper-case symbols and the set of real (complex) 
numbers by R (C). The magnitude of a complex number x 
is denoted by | where x* is the conjugate of 

the complex number x. The transpose, conjugate transpose, 
and inverse of a matrix X are denoted by X^, X^ and 
X~^, respectively. The inner products between two vectors 
X, y G C^^^ is defined accordingly as (x, y) = Y^^^i Vn^n- 
The W-weighted Euclidean norm of a vector x is denoted by 
||x||^ = Vx^Wx, and the conventional Euclidean norm is 
written as ||x||. The Euclidean norm of a matrix A is denoted 
by II A II and the Frobenius norm is denoted by ||A||i7. Given 
a matrix A = [ai, • • • , bat] where is a column vector, the 
vectorization operator is defined as vec(A) = [af , • • • , a^]^. 

The paper is organized as follows. In Section |ll| we define 
the NLLS problems and provide the corresponding distributed 
NLLS formulation in a network. Then, the proposed GGN 
algorithm is described in detail in Section [lll| with thorough 
convergence analysis conducted in Section |IV| Furthermore, 
we formulate power system state estimation in Section [V] as 
a NLLS problem and solve it using the proposed GGN algo- 
rithm. Finally, the convergence and performance of the GGN 
algorithm for power system state estimation is demonstrated in 



Section [Vl| with comparisons to other decentralized estimation 
schemes. 

II. Problem Statement 

A non-linear least squares (NLLS) problem is typically 
formulated as 



mm 



(1) 



where x G X is the underlying A^-dimensional state vector 
belonging to a closed convex feasible set X, and g(ic) = 
[gi{x)^--- ,^m(^c)]^ is a vector- valued function with M 
outputs determined by gm{') '• ^ M, m = 1, • • • , M. 



A. Centralized Gauss-Newton Algorithm 

If all the agents aggregate their data and the functions to 
a central point, then the Gauss-Newton method is usually 
employed to solve NLLS problems |4| in an iterative manner 
as follows 



(2) 



with some initialization x^, where is the step-size in the 
k-th iteration and Px[-] is a projection onto the constrained set 
X. The quantity is the Gauss-Newton descent 

= [G^{x^)G{x^)] G^{x^)g{x^), (3) 

where G{x) is the M x N Jacobian matrix defined as 
G{x) = dg{x)/dx^. In general, the fixed point of the 
algorithm update ^ is not unique, each corresponding to one 
of the stationary points of the cost function satisfying the first 
order condition 



G^(ic*)g(ic*) = 0, x"" e X. 



(4) 
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where the exact decentraUzed descent is given by 





network sensors 

network sensors 

network sensors 
Fig. 1. Schematic of multi-agent computation structure. 



Note that if is chosen differently at each iteration, 
the algorithm is called the damped Gauss-Newton method 
while ak = corresponds to the undamped Gauss -Newton 
method. If the Gauss-Newton Hessian matrix {x^)G{x^) 
is positive semi-definite, the resultant constitutes a descent 
direction of the objective function. It is well-known that if the 
step-size ak chosen according to the Wolfe condition ||T|, the 
Gauss-Newton iteration converges to a stationary point of the 
cost function. Since many NLLS problems are non-convex by 
nature, the focus in this paper is to study the local convergence 
property of the algorithm to an arbitrary fixed point cc"^ G X. 



B. Distributed Formulation 

Although the centralized Gauss-Newton algorithm is well 
understood, it is not immediately clear how this iteration can 
be implemented in a distributed manner such that similar local 
convergence properties can be maintained. As shown in Fig. 
[T] suppose there are / distributed agents, and the z-th agent 



only knows a subset function gi(cc) 
l{x) = : 



pN 



pMi 



from ([T]) 



(5) 



with M = X)i=i ^i- III this setting, the goal is to minimize 

/ 

min J2\\gi{x)f, (6) 

where each agent has only partial knowledge of this global cost 
function. In this case, it is also not clear how to coordinate 
the step-size at different agents such that the Wolfe condition 
|[T| would satisfy in a global sense. A variable step- size is also 
quite inconvenient in a decentralized setting, because of the 
difficulties of coordinating a change in the step- size across a 
network. As a result, we study the convergence behavior of 
the undamped Gauss-Newton case with a constant step- size 
ak = a < 1 for an arbitrary agent i 



dt = lG^(a.f )G(x 



G^(xf)g(a.f). 



(8) 



It is known from ^ that at each iteration, each agent requires 
the computation of 



while the z-th agent has only partial information available to 
compute Gj {x^)Gi{x^) and Gj{x^)^i{x^) locally. Note 
that these quantities are written as sums of individual network 
components, therefore, the distributed computations of the 
Gauss-Newton algorithm can emulate the centralized version 
by exchanging information between different agents, similar 
to that in |32|. Therefore, in next section, we introduce the 
Gossip-based Gauss-Newton (GGN) algorithm in a network 
for solving NLLS problems. 

III. Distributed Gauss-Newton Algorithm via 
Network Gossiping 

The proposed GGN algorithm emulates the computation of 
in ([3]) in a fully distributed manner. There are two time 
scales in the GGN algorithm, one is the time for Gauss-Newton 
update and the other is the gossip exchange between every 
two Gauss-Newton updates. Throughout the rest of the paper, 
we consistently use update for the Gauss-Newton algorithm, 
denoted by the discrete time index "/c" and exchange for 
network gossiping, denoted by another discrete time index 
We assume that all the network agents have a common 
clock that runs synchronously and determines the time instants 
t = Tk for the k-\h algorithm update across the network. 
Between two successive updates t G [r/e,r/e+i), the agents 
communicate and exchange information with each other in 
the form of network gossiping at time Tk^i G [r/c,r/c+i) for 
£ = I,-- - after each update k. The flow chart of the 
algorithm is illustrated in Fig. [2j where the algorithm update 
and gossip exchange alternate in time. 

In this section, we walk through the local update model for 
the /c-th algorithm update at each distributed agent in Section 



(7) 



III-A and introduce in Section III-B the gossip model for 
every exchange £ = 1, • • • that takes place between the 
k-th and (/c + l)-th updates, where the / agents advance their 
computations via network gossiping. 

A. Local Update Model 

Let xf be the local iterate at the i-th agent after the k-th 
update. For convenience, let 

r(x,^) = i^Gj(xf)g,(a.f), (9) 
R{x'^) = ^j2^l{xf)GM)- (10) 



^ Exchange ^^^^^ Exchange 

k = 1 k = 2 




k = K-l 



k = K 



Fig. 2. GGN algorithm update and gossip exchange flow chart. 



The "exact decentralized descent" in ([8]), if it were to be 
computed at the i-th agent for the {k -\- l)-th update, can be 
equivalently obtained as 



(11) 



which is impossible to obtain in a distributed setting. This is on 
one hand because the accessible functions gp(-) and Jacobians 
Gp(-) to agent p does not have the local estimate at agent 
i ^ p to evaluate their functions for such computations, and 
on the other hand, the i-th agent does not have the accessible 
functions gp(-) and Jacobians Gp(-) even if they can be 
evaluated by other agents. In fact, the available information 
at the i-th agent after the k-th Gauss-Newton update is 
Gj {x^)gi{xf) and Gj {xf)Gi{x^). Therefore alternatively, 
we propose to use an average surrogate for r{x^) and ^{x^) 

hk = ]i2^nx'i)giixt), (12) 

i=l 

1 I 

tlk = jY.^f{x'l)Gi{x'l), (13) 



which can be obtained via network gossiping. Intuitively 
speaking, if the distributed estimates stay close x^ ^ Xj, then 
h/e ^ r{x^) and Hfc ~ '^(x^) for all i. 

Therefore, after the k-th update by each agent at r^, the net- 
work enters gossip exchange stage [r/e,r/e+i) to compute the 
surrogate hk and Hfc. Define the length- A^^ local information 
vector (i.e., = N{N + 1)) at the i-th agent for the ^-th 
gossip exchange at Tk/ 



vec [UkA^)] 



(14) 



evolving from the initial information 4>k,i{^) at each agent 
with h/e,i(0) and H/e,i(0) given below 



hfc,z(0) ^ 
H,,,(0) ^ 



Gj{x^)g,{x^), 
Gj{x^)G,{x^). 



(15) 
(16) 



With this definition, clearly we have hk = h/e,i(0)// 
and H/e = H/e,i(0)//. Then, all network agents ex- 

change their information h.k^i{i) hk^i{i-\-l) and Hk,i{^) 
Hk,i{^ + 1) using the gossiping protocol described later in 
Section [IirHl 

After £k exchanges, the "inexact descent" for the + l)-th 
update at the i-th agent is 

d,^(4) = H-](4)hfc,^(4) (17) 

and the local estimate is updated as 

x^^'=Px\x^-ad^{h)]- (18) 



B. Network Gossiping Model 

Before we start, we begin by describing the communication 
model for network gossiping that supports the information 
flows and exchange among the agents. We borrow the insights 
from |T0| , pO| , pOl and impose some rules on the commu- 
nications among all the agents over time. Let us consider a 
time- varying random communication graph Qk/ = (^, A^fe,^) 
during [r/e,^, r/e^^+i) for every k and £, with the node set 
X = {1, • • • ,/} (i.e., agent) and the edge set {i^j} G Mk,£- 
The edge set Mk/ is characterized by the adjacency matrix 



Afe(^) = [A^%xi. where ^^'^^ = 1 if {ij} e Mk,i and 



ik/) _ 



if otherwise. For each agent i in the network, there is an 
associated neighbor set M^^\ = {j : {i^j} G Mk/} with 
which each agent exchanges information locally. 



Assumption 1. The composite graph (X, is 
connected for all £ > and there exists an integer L > 1 
such that for every agent pair {i^j} in the composite graph, 
we have for any £ > 



{ij} e Mk/[jMk/+i[j- ■ - \jMk,i+L-i- 



(19) 



The above assumption states that the composite graph 
consists of all agent pairs {i, j} that communicate directly 
infinitely many times and that this composite graph is con- 
nected. Furthermore, each agent communicates with another 
agent in the composite graph within a bounded communication 
interval of L, such that there exists an active link between any 
agent pair {i^j} G UPL^-^fe,^' at least every L consecutive 
time slots [rk/,rk/^L-i] ^ (r/c,r/e+i) for any £. 

With the communication model described in Assumption [T] 
each agent combines the information sent from its neighbors 
with certain weights. Define a weight matrix Wk{£) = 
[Wtj{^)]ixi for the network topology during [r/e,^, r/e,^+i), 
where the (i, j)-th entry W^j{£) of the matrix Wk{£) is the 
weight associated to the edge {i, j}, which is non-zero if and 
only if {i, j} G Mk,£- Therefore, the matrix Wk{£) has the 
same sparsity pattern as the communication network graph 
Ak{£), and it is determined by the network's connectivity. 

Assumption 2. For all k and £, the weight matrix W/c(^) 
is symmetric W^j{£) = Wj^^{£) and doubly stochastic. There 
exists a scalar r] with < r] < 1 such that for all i^j el 

1) W}i{£) > rjfor all k >0 and £ >0. 

2) W^^l£) > rjfor all k >0 and £ >0 if {ij} G Mk,i- 



3) W^M) = Ofor all k>0 and£>0 if {ij} ^ M 



kl- 



The gossip exchange of each agent is local with its neigh- 
bors using this weight matrix W/c(^). For the i-th agent, the 
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local information is mixed with its neighbors as 



(20) 



for all i = I,-- - ,/. Then for the next exchange £ + 1, 
the network repeats the same process. By stacking the lo- 
cal information vectors into an ensemble vector = 
[0^ 1 (^) , • • • the exchange model can be written 

compactly as 

ct>M = [Wfc W In,] M - 1), 1 < ^ < 4, (21) 

where Ik is the maximum number of message exchanges 
during [r/e,r/e+i). 

The gossip exchange model specified in ( [20| ) under As- 
sumption [T] and [2] is a general model that includes time- 
varying network formations, where all agents form random 
communication links with their neighbors and advance their 
computations of the average of all local information vectors 
4>k = Y^i=i^k ii^)/^- With the prescribed communication 
model, we highlight the following two special cases which 
are often analyzed in consensus and gossiping literature pQ| , 

ini, ng, |T7|. 

1) Coordinated Static Exchange (CSE) fT3l, (Tf]: In 

the CSE protocol, each agent combines the information from 
possible multiple neighbors, determined by the communication 
network A, with a static weight matrix W for all updates and 
exchanges at Tk/ G [r/e, r/e+i) for ^ = 1, • • • , In particular, 
if the network is fully connected such that A = I — 11^, 
the communication interval is simply L = 1 in which each 
agent talks to everybody in every exchange. There are multiple 
ways to choose the weight matrix in the CSE protocol, where 
one of the most popular choice is constructed according to 
the Laplacian L = diag(Al/) — A as W = 1/ — wL with 
w = (3/ max(Al/) for some < (3 < 1. 

2) Uncoordinated Random Exchange (URE) |15|: For 
each exchange in the URE protocol during [r/c,rfc+i), a ran- 
dom agent i wakes up and chooses at random a neighbor agent 
j G to communicate. We define the matrix F = [yij]ixi 
whose (z, j)-th element jij represents the probability of node i 
choosing agent j once agent i wakes up. The gossip exchanges 
are pairwise and local |15|. Suppose agent I^^i wakes up at 
^k,£ G [T/c,r/c+i) and Jk/ is the node picked by node Ik/ 
with probability 7/^^,7^^. Then given some mixing parameter 
< /3 < 1, the weight matrix at this time is 

Wk{e) =1-/3 (e,,,, + ej, J (e,,,, + ej,_, f . (22) 

We acknowledge that typically, the URE protocol is com- 
pletely random and asynchronous, which does not necessarily 
satisfy Assumption [T] due to link failures and random link 
formation between the communicating agents. In the analysis, 
we assume that Assumptioin[T] holds, while in the simulations 
we show the robustness of the proposed algorithm to link 
failures when Assumption [T] may not hold. 

To facilitate further study on the network gossiping error 
under the general model, we invoke the following lemma and 
use it in later analysis. 



Lemma 1. [2U, Proposition 1] Let Assumption^and^hold. 



Then the entries of the product 



for every k 



converge with a geometric rate uniformly for all i^j G X as 



n 



i'=0 



< 2- 



1 — 77 



Lo 



(1 



V 



(23) 



with Lq > L being the least number of exchange for every 
agent to communicate exhaustively with all other agents in 
the composite graph, and the limit exists when i ^ 00 



^ 1 
lim Y[Wk{n = jll' 



(24) 



<?'=0 



Given Lemma [T] we have 



1 I 

lim 0fc,,(^) = -^0fc,,(O), (25) 
which accordingly results in an asymptotic local descent 



lim d,^(oo)=H-^h,. 



(26) 



Note that the error made in computing the local descent ([TT]) 
to emulate the exact descent in ([TT]) stems from two sources, 
including the gossiping error that persists with a finite Ik and 
the mismatch error by using the average surrogates h/^ and 
H/e instead of the exact global information. In the following 
section, we analyze the effect of this error in the convergence 
of the decentralized algorithm. 

IV. Convergence Analysis 

The GGN algorithm is summarized in Algorithm 1. In 
this section, we analyze the convergence of the local GGN 
algorithm by examining the recursion in ([18]). We make the 
following generic assumptions on the NLLS problem. 

Assumption 3. assume the following properties 

1) The vector function is bounded ||g(ic)|| < Cgfor a? G X. 

2) Denote by Amin(-) and Amax(') the minimum and max- 
imum eigenvalues and let 



CTmin = min ^JXmin {G^{x)G{x)), 



CTmax = max A/Amax {G^{x)G{x)). 

Assume that the Jacobian G{x) is full-column rank for 

all X with < <Jmin ^ CTmax < CO. 

3) The Jacobian G{x) satisfies the Lipschitz condition 
\\G{x) - G{x')\\ <uj\\x- x'W , x,x' e X, 
where uo is the Lipschitz constant. 



A. Perturbed Error Recursion Analysis 

At the (/c+l)-th update, the error between the local estimate 
x\^^ and a fixed point in (|4]) satisfies the following recursion. 

Lemma 2. Let X be a closed convex set and Assumption p] 
hold. The error \\x^~^^ — x'^W between the local iterate xf 
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Algorithm 1 Gossip-based Gauss-Newton (GGN) Algorithm 



given initial variables at all agents z G X. 

set A: = 0. 

repeat 

set A: = /c + 1. 

initialization: For i G X, each agent i evaluates 
hfc,,(0) = GjixDzM) 



respect to can be bounded as 

limsup \\x^i^^ - x"" 

k^oo 



< Pn 



(36) 



Proof: If the discrepancy error is upper bounded by a 
constant k, > such that 



(27) 
(28) 



(37) 



then from Lemma |2j the recursion can be upper bounded by 



\x, 



X 



-To xf 



X 



and constructs 4>k,i{^) ( [T4| ); 

network gossiping: Each agent i exchanges informa- 
tion with neighbors via gossiping 

4>k{tj = [Wfe(^) Iat^] (/)/.(^ - 1), 1 < < 4- Let Ci,/c = ll^i' - ^^'^IL then the error recursion dynamics can 



an. 
(38) 



7: local update: For i G X, each agent i updates 

d,^(4) = H-J(4)hfc,^(4) 



(29) 
(30) 



8: until \\x^^^ 



x7 



< e ox k = K. 



9: set the local estimate as x. 



xf. 



generated by the algorithm update ([18]) and an arbitrary fixed 
point x^ in ^ satisfies the following recursion 



\X. 



/c+1 



X 



X 



X 



(31) 
(32) 



where Ti = uj/^a^i^, T2 
aV2uje^/a'^-^ and = Wgix"") 



(1 - a)ama^/amm + 



Proof: See Appendix [A| ■ 
The error recursion is a perturbed version of the centralized 



c(4)-dr 



between 



recursion with the discrepancy error 
the distributed and centralized update. Next, we propose the 
main result on the convergence of this perturbed recursion. 

Theorem 1. f Sufficient Condition for Convergence with 
Bounded Perturbation) Let Assumption |j] hold. Let x e X 
be a closed convex set and 



(I-T2)- v/(l-T2)^-4aTi"^ 

(1-T2) + V(l-T2)^-4aTi"^ 
2Ti 



(33) 



with Ti , T2 > given in ( [3T] ) for some a and < < 
(1 - T2) V4aTi. // V2coe, < a^,^ and (1 - a^in/^^ax) < 
a < 1 such that < T2 < 1, and the discrepancy error can 



be bounded as 



d 



< n for all k > and i G X, 



then given any x^ that falls within the p^ax-neighborhood of 
a certain fixed point a?"^ G X 

W^i II ^ Pmax? (35) 

the asymptotic error of the local iterate x^ at each agent with 



be expressed as a dynamical system as 

< TiClk + ^20,fc + c./^, > 0. (39) 

Since Ci,/c is non-negative, this error dynamic can be upper 
bounded by the dynamical system of p^+i = il){pk) with 



i^{pk) = TipI + T2pk + ai^, pk > 0, 
whose equilibrium points are evaluated as 

p = Tip^ + T2P + ati. 



(40) 



(41) 



When K, satisfies k, < {1 — T2Y / ^oTi, the equilibrium solu- 
tions to ( |4T] ) exist and are obtained as in ( [33] ). According to 
1 51 1, an equilibrium point is a stable sink if |V^(-)| < 1 and 
unstable otherwise. Thus when < T2 < 1, the equilibrium 
point pmin is a sink because 



^(Pmin) 



l-^{l-T2f-AaT^n 



< 1, 



(42) 
(43) 



where ijj{p) = d?/^(p)/dp is the first order derivative of the 
dynamics, while Pmax is unstable since ip{pmax) > 1 for 
any T2. To guarantee < T2 < 1, it requires 



1 



< a < 1. 



<^min<^max 



(44) 



To guarantee that the term on the left is strictly less than 1, it 
is required that V2uje^/a'^^^ < 1. Therefore given ^/2uje^ <C 
^min' it is sufficient to have (1 — cTmin/c^max) < a < 1 such 
that all the requirements are met. 

Thus, if the initial error (^^ n ^ Pmax? the error is infinitely 
(34) accumulated and become unstable lim Q = 00. On the 



other hand, as long as the errors are constantly bounded by 

< Ci,k < Pmax 

for all z's and /c's, the algorithm reaches the equilibrium error 
floor pmin- Thus, as long as the initialization error Ci,o satisfies 
< Ci,o < Pmax for z = 1, •••,/, the algorithm progresses 
with contracting error until reaching the error floor pmin due to 
the constant bounded perturbation k,. Finally, it is shown that 
as long as the initial condition x^ satisfies ||ic^ — a3*|| < Pmax 
with respect to a certain fixed point x'^, the error norm is upper 
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bounded by the steady state error pmin 

limsup — ic^ll < pn 



An immediate result of Theorem [T] is that if the gossip 
exchanges ^/c's are moderately large, then k, ^ and the 
GGN iterate asymptotically approaches the centralized 
version x^. An intuition that can be drawn from the sufficient 
condition is that the smaller the Lipschitz constant u, the larger 
radius off the fixed point x'^ can be allowed for convergence. 
In other words, the less non-linear the cost function the 
better the convergence. Furthermore, if the problem is very 
consistent, meaning is very small at stationary points (which 
is often the case in data fitting problems in sensor networks), 
then pmax ^ 2crmin/^ " ^ and the steady state error is 
approximately pmin ^ which scales with the errors resulted 
from decentralized updates via gossiping. In particular, the 
convergence is quadratic if = and k, ^ 0, which means 
that the GGN algorithm achieves the same convergence rate 
as that in Newton's method without computing second order 
derivatives. Now that we have proved convergence of the GGN 
algorithm under bounded perturbations, we proceed to prove 
that the bounded perturbation assumption holds. 



B. Perturbation Analysis of n 

Preceding analysis has proven that if the perturbation error 
due to the distributed updates {£k) — is bounded, the 
condition proposed in Theorem [T] is sufficient to guarantee 
convergence of the GGN algorithm. In the following, we 
proceed to analyze this perturbation and further show that 
the bounded condition holds when the distributed update is 
achieved via network gossiping. 

1) Network gossiping error: Define at the ^-th exchange 
and their deviations from the exact averages hk and H/c as 



efcW = [er,i(€),. 
EfcW = [Ej,(£),. 



where ek,i{i) = hk,i{i) - hj, and Ek,i{i) = iik,i{^) - Hk 
The gossip errors ek{ik) and F/k{ik) are closely related to the 
properties of the weight matrices W/e(^) given in Lemma [T] 
as stated below. 

Lemma 3. Let Assumption |7] and |2] hold. The gossip error 
ek{ik) <^nd E/e(£/c) after the k-th update at the ik-th exchange 
can be bounded as 



\\ek{lk)\\<C\'^\ ||E, 
where C = 2P 



I p < , 



^ max V 



Proof: See Appendix Ib] ■ 



and = (1 



2) Perturbation: Define the errors between the surrogate 
h/c, H/c and the exact information r{x^) and R(^cf ) as 



(45) 



1 I 

A,,, = H, - R(a^f ) = - ^ [H,,,(£) - UkA^)] ' 

p=i 

Since the individual Jacobian Gi{x) is a sub-matrix of G{x), 
thus from the Lipschitz condition (3) in Assumption |3] and 
1 52, Corollary 3.L3], it also satisfies 

||G,(x)-G,(a;')ll<l|G(a.)-G(x')ll 

< ||a3 — ic'll , x^x^ gX. 

From conditions (1) and (2) in Assumption |3] and |[53j 
Theorem 12.4], Gf {x)gi{x) and Gf {x)Gi{x) also satisfy 
the Lipschitz condition with constants us and for arbitrary 
x^x' 

||Gf (a.)g,(a.) - Gj{x')g,{x')\\ < \\x - x'\\ 
||Gr(a;)Gi(a;) - GJ {x')Gi{x')\\ < \\x - x'\\ , 

where > uj{Cg + (Jmax) and > 2(Ji„ax'^- By definition 
( |45| i and the Lipschitz conditions, we have 



< 



^ 1 



(46) 



(47) 



Thus, we can express the local information for each agent 
during gossip in relation to the exact information 

hk,^{i) = r{x^)^Sk,^^ek,^{i), (48) 

l^k,^{i) = K{X^) + Ak,^ + Efc,,(£). (49) 

Clearly, this discrepancy depends on the gossip errors 
E/c,i(^/c) and ek^i{ik)^ and the inconsistency Ak^i and Sk,i 
due to the disagreement ||ic^ — ic^|| for each pair of i-th and 
j-th agents. Gossip errors have been specified in Lemma |3j 
thus in the following we bound the disagreement — Xj^. 

Assumption 4. Denote the minimum number of gossip ex- 



changes as £^ = min/c 
chosen to satisf^ 



c}. We assume that {ik}kLo 



D= \im y A(^-^*) < oo. 



and for any given < <C 1, the minimum exchange £^ 
satisfies 



L > log 



/log A^, 



simple choice is £o = l^, and £k = ^k-i + 1. then D = 1/(1 — Ar^). 
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where C is a constant defined in Lemma |j] and 

I 



with V = max{2, v^^ v^. 



^Cgl 



(50) 



Lemma 4. Given Lemma |7] under Assumption p] |2] |j] and |5] 

if the initializer at each agent satisfies = x for all i, the 
deviation lla?^ — ic^ll for any i and j satisfies 



\x- 



■ X- 



<CCA\;Du- 



where Dh 



A 



Proof: See Appendix [C] ■ 

Proposition 1. Given Lemma [7] |j] and Q under Assumption |7] 
|2]|j]a^j|4] the discrepancy between the inexact decentralized 
descent and the exact decentralized descent is bounded as 



<(4)-< <K 
for all i and k with an arbitrary < a^: <C 1. 



(51) 



Proof: See Appendix [P] ■ 
Remark 1: We acknowledge that the condition in Theorem [T] is 
proven using very pessimistic bounds, such as the eigenvalues 
bounds in Assumption [3] and therefore it is sufficient but not 
necessary. In fact, as shown in Section IVl| even when the 
sufficient conditions are not satisfied, the algorithm behaves 
well even with link failure present. 

The GGN algorithm and its variants can be found in various 
papers, where existing works have directly used this algorithm 
in sensor networks |32]-f34l for localization, however with- 
out the performance and convergence analysis. The analysis 
conducted in this paper can serve as a general guideline to 
predict the numerical stability in different situations. In the 
following, we demonstrate its application in power systems, 
which has gained considerable interests and popularity in the 
past few years to realize wide-area distributed control. 

V. Application : Power System State Estimation 

A power network is characterized by vertices (called buses) 
representing simple interconnections, generators or loads, de- 
noted by the set A/" = {1, • • • , N}. Transmission lines connect 
these buses are the edges of the power grid topology, denoted 
£ = (n, /) with cardinality \8\ = L that corresponds to 
the transmission line between bus n and n, / G A/". Each 
transmission line is characterized by the admittance matrix 
Y = [-Yni]NxN, where Y^i = Gni + i^n^ (^,0 ^ ^ 
is the line admittance, and each bus has a shunt admittance 
Yni = Gni + ^Bni associatcd with the E-modeQ of the trans- 
mission line (n, /) G £. Note that Y^n = - T^i^ni^ni^Yni) is 
defined as the self-admittance. The state of the power system 
corresponds to the voltage phasors at all buses, described 
by voltage phase and magnitude x = [0^,V^]^, where 
^ — [^1, • • • 7 ^A^]^ is the phase vector with Oi being the slack 
bus phase, and V = [Vi, • • • , Vn]^ contains the magnitude. 

^ The n-model is a circuit equivalent of a transmission line by abstracting 
two electric buses as a two-port network in the shape of a 11 connection (54|. 



A. Power Measurement Models 

Nowadays, power measurements include phasor mea- 
surements {VmOn) and the active/reactive power injection 
{Pfi^Qn) for buses n G A/", and the active/reactive current 
{Ifih Jni) and the active/reactive power fiows (PnhQni) on 
transmission lines (n, /) G £. The ensemble of these quantities 
can be stacked into the length- 2 A/^ phasor measurement vector 
f p^{x) = X and power injection vector fj^{x), as well as the 
length-4L current vector fx{x) and line flow vector fs{x) 
respectively 

/x(^) = [••• Jnl{x),--- ,Jnl{x),---]^ (52) 

f^{x) = [Pi{x)r-' ,PN{x),Qi{x)r-- .QN{x)f (53) 
f^{x) = [-- - ,Pni{x)r- - r-' .Qni{x)r-'f (54) 
and expressed in relation to the state x as in p4| 

N 

Pn{x) = Vn^Vi {Gni COS 6nl ^ B^l s'lli Onl) 
N 

Qn{x) = Vn^Vi {Gni s'lli Onl - Bnl COS Onl) 
ly^n 

Pnl{x) = V^Gnl - VnVi {Gni COS Onl + Bnl s'mOnl) 
Qnl{x) = -V^Bnl - VnVi {Gni sin 6>nl - Bnl COS0nl) , 

where 6ni — On — ^i- The current expressions are omitted 
because it is directly the Ohm's law. Note that these quantities 
can also be expressed in terms of the real and imaginary parts 
of the voltage, but here we stick to the conventional polar 
representation of the magnitude Vn and phase dn- 

The above power quantities are recorded by field devices 
in the power grid, and gathered centrally by aggregation to 
control centers. By stacking all these quantities together as a 
vector function of the state x and all the measurements into 
a vector z, the ensemble of all power measurements is 



z /(a?) +s, 
where x represents the true state and 















A 

z = 


ZAT 






, s4 


ex 
ej\f 
eg 



(55) 



(56) 



B. Formulation and Solution 

In practice, a reasonable abstraction of the data acquisition 
architecture in power systems is as an interconnected multi-site 
infrastructure, with / sites in which the z-th site covers a subset 
of buses n e Mi satisfying J\fj = and Mi.Mj C M. 
The z-th site temporally aligns and aggregates a snapshot of 
Mi local measurements of {zi^ni}m=i within the site or on the 
lines that connect its own site with others, as shown in Fig. 
|3] Also, the observations gathered are not exhaustive. The i-th 
site's measurements are selected from the ensemble in ( [55] ) as 



TiZ = f^{x)^e^ 



(57) 
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Fig. 3. Multi-site structure in IEEE-30 test case 



Where z, 4 [z^;,, z^^, z^^, z^,]^, f,{x) ^ TJ{x), 

and = diag[T^,A',T^,x,T^,Ar,T^,£:] is a block diag- 
onal binary matrix selecting the corresponding measure- 
ments at the z-th site. Specifically, Ti x e {0, l}^^ -^ 

{0, l}^^''^x4i/ selection matrices with each row having 
only one "1" entry located at the index of the corresponding 
element in f{x) measured by field devices. Note that the 
number of measurements recorded by each agent is Mi — 
M,^x + M,,x + M,^M + Mi^e = Tr(Tf T,). 

The universally accepted problem formulation for static 
state estimation is to solve a weighted NLLS problem that 
fits the estimated state to the power measurements p6| , p7| . 
In the following we set the weights to be equal for simplicity. 
Assuming E{ee^} = a^I, the state is estimated as 

/ 

S = min V||z,-/,(a^)||' (58) 

a3GX ^ — ' 
i=l 

where X = {On e [-6>max, 6>max], K e (0, Fmax], n e Af} 
with 6>max and Knax being the phase angle and voltage limit. 
Clearly, by letting 

g,(x) ^ z, - Gi{x)^-^^ (59) 

the problem can be solved using the proposed GGN algorithm. 
Remark 2: In many practical scenarios p4| , |[27|-|[29|, many 
similar NLLS problems in a network take the form 

/ 

x = min llz^m] - /^(^[m])||^ , (60) 

where Zi[m] G is a snapshot of measurements taken at 
agent i at time m and x[m] is the true state at that time. In 
this scenario where the measurements stream in sequentially 
over time, instead of the static scenario elaborated previously, 
the GGN algorithm can be readily applied to solve such 
dynamic problems in real-time by initializing x^ [m] with the 
previous local estimate Xi [m — 1] . In the following, we show 



numerically the applicability of the proposed GGN algorithm 
on estimating and tracking the state of power systems using 
real-time power measurements from substations. 

VI. Numerical Results 

In this section, we compare the GGN algorithm in terms 
of the objective value ( [58] ) and the Mean Square Error (MSB) 
under the CSE protocol with existing network diffusion al- 
gorithms |48| and in particular, we show numerically the 
applicability of the GGN algorithm to adaptive processing as 
described in ( [60| ) against the method proposed in f24]. The 
MSE with respect to V^s and O^s at the i-th site is 

MSE^;) = ||Vi - Vf , MSEg) = ||0i - 0f . 

We further illustrate the MSE performance of GGN algorithm 
using the URE protocol in the presence of random link 
failures. The simulation is based on the IEEE-30 bus (N = 30) 
system in MATPOWER 4.0. The initialization is simply set to 
1 for voltage magnitude and for phase. 

We take one snapshot of the load profile from the UK 
National Grid load curve from |55 | and scale the base load 
from MATPOWER on the load buses. Then we run the Opti- 
mal Power Flow (OPE) program to determine the generation 
dispatch for that snapshot. This gives us the true state x and 
all the quantities f{x), which are all expressed in per unit 
(p.u.) values. For simplicity we randomly choose 50% of all 
available measurements f{x) and generate the measurements 
by adding errors Si^rn ^ A/'(0,cr^) with = 10~^. 

A. Comparison with Diffusion Algorithms under CSE Protocol 

In this subsection, we evaluate the overall performance of 
the GGN against existing network diffusion algorithms |48| 
and its extension to adaptive processing in |24|. To make a 
fair comparison in terms of communication costs and accuracy, 
we exploit the coordinated model (i.e., CSE protocol) for [24|, 
|(48l and our method, where the agents exchange information 
synchronously. For simplicity, we divide the entire system 
into / = 3 sites as in Fig. [3] Communication links exist 
between every two agents {i^j} G M for Vz, j G X, giving 
an adjacency matrix A = l/lj— I. The weight matrix used 
in both cases is constructed according to the Laplacian L = 
diag(Al/) — A as W = 1/ — wL with w = (3/ max(Al/) 
and f3 = 0.3. The step- size for the GGN algorithm is set 
as acGN = 0.5 while adifr = 10~^, 10~^, 0.05 for the sub- 
gradient methods via network diffusion. 

The network diffusion algorithm proceeds as each exchange 
£ takes place, while the GGN algorithm runs = = = 3 
exchanges for each k-th update. In particular, the comparison 
is on the value of the global cost function ( [58] ) evaluated using 
the decentralized estimates 
/ 

Val = ^||zi-/,(£i)f (61) 

i=l 

and the global phase MSE 

MSEy = MSE(;^ , MSEe = ^ MSE^^ (62) 

i=l i=l 



10 




>-GGN Algorithm (CSE) (p = 0.03, a=1) 

Diffusion+lnnovation (Static) [48] (p = 0.03, a=0.001) 
- Diff usion+innovation (Static) [48] (p = 0.03, a=0.005) 

Diffusion+innovation (Static) [48] (p = 0.03, a=0.01) 
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(a) Objective value (b) MSEy Comparison (c) MSEe Comparison 

Fig. 4. Comparison between GGN Algorithm (CSE Protocol) and diffusion algorithm in [53] against the gossip exchange with i^, = 3 exchanges for every 
update. 




>-GGN Algorithm (CSE) (p = 0.03, a=1) 

Diffusion+lnnovation (Adaptive) [24] (p = 0.03, a=0.001) 
- Diffusion+lnnovation (Adaptive) [24] (p = 0.03, a=0.005) 
Diffusion+lnnovation (Adaptive) [24] (p = 0.03, a=0.01) 
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(b) MSEy Comparison 



Gossip Exchange 

(c) MSEe Comparison 



Fig. 5. Comparison between GGN Algorithm (CSE Protocol) and adaptive diffusion algorithm in [24] against the gossip exchange with ii, = 3 exchanges 
for every update. 



which is plotted against the total number of gossip exchanges 
so that the comparison is performed on the same time scale. 
The marker "o" for the GGN algorithm is placed on the scale 
of each algorithm update k. Since £k = 3 for all /c, the update- 
exchange ratio is 1 : 3 such that the marker "o" occurs once 
every 3 points on the curves. 

1) Estimation on Static Measurements: In this subsection, 
we show the comparison between our approach and that in 
| [48| over 9000 gossip exchanges overall. Clearly, the GGN 
algorithm converges much faster since it reaches the steady 
state error after k = 10 updates (i.e., k£^ = 30 exchanges). 
Since the convergence results in Theorem [T] on the bounded 
perturbation are sufficient but not necessary, it is observed in 
Fig.|4(a)|to|4(c)|that the objective values Val and the MSEe of 



the state estimates still decrease rapidly over time even though 
the gossip exchange per update = 3 is small. On the other 
hand, the objective value and the MSE of the network diffusion 
algorithm for AC state estimation in | [48| decreases slowly 
until 10000 exchanges with large oscillations (due to constant 
step- sizes). Network diffusion algorithms using a diminishing 
step-size Ofdiff/^ for ^ = 1, 2, • • • exhibit similar performance 
except for the oscillations and hence are not presented for 
clarity. Also, it can be seen that for a non-convex problem, 
the steady state of the network diffusion algorithm is sensitive 



to the choice of the step- size a^m and there exist significant 
oscillations that persist in the update. On the other hand, the 
GGN algorithm conditions the gradient by the GN Hessian 
and therefore the update tends to be smooth and continues to 
lie in the proximity of the desired solution with high accuracy. 

2) Estimation via Adaptive Processing: Here we show 
numerically the applicability of the GGN algorithm to adaptive 
processing as described in ( |6Q| ) against the method proposed in 
|24| with the same step-size and network setting. In this sim- 
ulation, we generate 4 snapshots of measurements {'Zi[m]}l^i 
for m = 1 , • • • ,4 based on the same state x[m] = x by adding 

i. i.d. Gaussian noise with variance = 10~^, similar to the 
adaptive setting considered in |24|. More specifically, we use 

ii, = 3 gossip exchanges between every two algorithm updates 
until k = 10, thus leading to a total number of 30 exchanges 
per snapshot. It can be seen from Fig. |5(a)| to |5(c)| that the 



proposed GGN algorithm tracks the state estimate accurately 
when new measurements stream in, where the spikes observed 
in the plots are caused by the new measurements. Since 
the number of gossip exchanges is limited, the first order 
diffusion algorithm in |48| and f24l, which suffers from slow 
convergence, exhibit similar characteristics to those in Fig. 



4(a) to 4(c) and fail to track the state as accurately as desired. 
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Fig. 6. MSB performance of GGN (URE Protocol) in IEEE-30 bus system with / = = 30 agents and 0{N) pair- wise gossip exchanges, (top) : perfect 
communication (bottom) : p = 0.3 random link failures (each line corresponds to one agent). 



B. MSE Performance under URE Protocol with Link Failures we track both the individual cost function 

Val« = ||z,-/,(2,)|| 



In this section, we examine the MSE performance of the 
GGN algorithm under the URE protocol with a fixed number 
of algorithm updates K = 4Q. The performance is evaluated 
with a demanding setting, where we divide the A^-bus system 
into N sites and each site only communicates with one of its 
neighbor 10 times on average. The network- wide communi- 
cation volume in this scenario is on the order of the network 
diameter 0{N), which implies the number of transmissions 
in the centralized scheme as if the local measurements are 
relayed and routed through the entire network. For simplicity, 
we simulate that at each exchange, the i-th distributed agents 
wakes up with uniform probability 1 // and picks a neighbor 
with equal probability 

In order to show the robustness of the proposed algorithm, 
we examine the performance of the GGN algorithm for cases 
with random link failures, where any established link {i, j} G 
M fails with probability p = 0.3 independently. It is clear that 
this random communication model with link failures may not 
satisfy Assumption [T] but it is shown below that our approach 
is robust to the random setting and degrades gracefully with 
the probability of link failures. As performance benchmarks. 



evaluated by local estimates Xi and the individual voltage 
and phase MSE of the estimation MSE^^^ and MSE^^ in 
Fig. [6] It can be observed from the figure that the MSE 
curves of state estimates of different sites are highly consistent 
and they all converge asymptotically when there is no link 
failures. Similar behaviors can be observed for the case with 
random link failures, where the local estimate at each site 
is not in perfect consistence with the others, but the accuracy 
remains satisfactory compared to the perfect case and degrades 
gracefully with the probability of link failures. 

VII. Conclusions 

In this paper, we study the convergence and performance of 
a Gossip-based Gauss-Newton (GGN) algorithm, which can 
serve as a guideline in solving similar problems. We also 
demonstrate its application in power system state estimation, 
where numerical results suggest that that the proposed algo- 
rithm leads to accurate state estimates across the distributed 
areas, which greatly improves the robustness against link/node 
failures or network congestions with communication and com- 
putations that scale similarly to the centralized scheme. 
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Appendix A 
Proof of LemmaO 



To study the convergence of the GGN algorithm, we exam- 
ine the distributed update 



X 



(63) 



which can be written with respect to the descent in ([TT]) 



(64) 



By subtracting the fixed point x* and using the non-expansive 
property of the operator Px( ) on the closed convex set X, we 
have the following recursion 



< 



ad 



a 



4{ik) - d'l 



For convenience, we denote G^(-) as the pseudo-inverse 
of G(-). For any fixed point cc"^ G X in (|4]) such that 
G\x'^)^{x'^) = 0, the first term can be equivalently written 
by substituting ([5]) as follows 

x^ - ic* - ad^ = x^ - x"" (65) 
- aG\x^)g{x^) + aG^{x^)g{x^). 

Using ([3j together with the invertibility condition of G{x) 
over G X in Assumption [3] we have 

x^ -x"" = G^{x^)G{x^) {x^ - x"") . (66) 



Then by substituting ( |66| ) into ( [65] ), and meanwhile adding and 
subtracting simultaneously a term aG^ {x^)g{x*), we have 
the following expression 

x^ -x"" - ad\ 

= G\x^,) [G{x^,) [x^, - x^) - ag{x^) + ag{x^)] 
-^a[G\x^)-G^{x^)] g{x^). 

The expression in the first term can be re-written with the 
mean- value theorem as follows 



(67) 

G{x){x'' -x) 



ag{x^) - ag{x) - G{x){x^ - x) 

= a [ G{x^t{x'' -x)){x'' -x)dt 
Jo 

= ^(^j^ [G(cc + t(cc*-x))-G(x)](x*-ic)dt 
-(l-a)G(x)(x*-cc), 
whose norm can be bounded by using Assumption [3] as 

\\ag{x^) - ag{x) - G{x){x'^ - x)\\ (68) 



< a 



[ WGix^tix"" -x)) -G{x)\\dt Wx-x-'W 
Jo J 

+ (1 - a)amax \\x - x'^W . 
From the Lipschitz condition in Assumption [3] we have 

/ WGix^tix-" -x))-G{x)\\dt<uj\\x-x''\\ [ tdt. 
Jo Jo 



Thus, if condition (3) of Assumption [3] holds, we have 
\\ag{x*) - ag{x) - G{x){x* - x)\\ 
< I ||a; - x*f + (1 - a)a„,ax \\x - x*\\ 



and finally according to j36J Lemma 1], we have 

||Gt(x)-Gt(a.*)|| 

<V2\\G\x)\\ ||Gt(a^*)|| ||G(a^) - G(a^^ 



< 



V2u 



\x — X 



By the definition of matrix norm, we have ||G^(ic) 



. Then evidently. 



GHx){GHx)f = {G^{x)Gix))-' 
the norm of a matrix inverse corresponds to the reciprocal 
of the eigenvalue of that matrix with the smallest magnitude. 
By Assumption 3 condition (2), this quantity is bounded as 
cr^-^, and therefore ||G^(cc)|| < l/cTmin. Thus finally, using 
||G^(ic)|| < l/cTmin and the above inequalities gives the 
following bounds 

x^ -x"" - ad\ < Ti \\x^ - x^\f + T2 \\x^ -x^W, 



where = ||g(cc^)|| indicates the goodness of fit at x'^ and 



Ti = , T2 = (l-a) ha — ^ . 



(69) 



Therefore, we have the error recursion ( [3T] ). 



Appendix B 
Proof of Lemma[3] 



Using ( [2T] ), we evaluate the deviation of (f^kW ^^om the av- 
erage = [1^ ^ Iat^] 4^k{^)/I for a finite £. By subtracting 
the average on both sides of ( [2T] ), we have 



i'=0 



0fe(O) 



/ 
I 



\l'=0 



MO)- 



Then, we bound the norms of the above equation as 



11^ 



n Wfc(^') - ^ 



\\4>M-4>k\\ < 

Using Lemma [1] and the norm inequality 



< 



n 



< 



e'=o 



\\4>km\- (70) 

•||^, we have 

Mm 

^ WM 
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The quantity ||(/);.(0)|| is by definition ( p4| ) determined as 

\\Mo)f = ^ \\hk,mf + E iihm(o)ii^ (71) 



Note that from the norm equaUty of sub-matrices 



i=l 
I 



||Ho-^Eo,.(4)|| < ||Ho-^ 
1^0 ^^oji^o 



< llHo'l 



|Eo,i(4) 
|Eo,i(4) 



< l|Ho 

< llHn 



\Wo)\\f 
l|Eo(4)||^, 



(72) 



< 



where the norm inequaUty is used ||Gf (a;)Gi(a;) 
||G^(x)G(x)||' < N\\G^ix)Gix)\\l = iVcxL,. Letting 



C = 2P^Ial,^{CI + Nal,J{l + v-^oy^l_^Lo) ^nd 
= (1 — rj^o^i/Lo^ ^j^gj^ ^j^g error is bounded as 

\\cf>M-4>k\\<CX'^. (73) 
By definition of the errors ek^i{i) and E/e,i(£), we have 



vec [Efe,i(< 



vec [Efc,K^)], 



(74) 



and therefore the norm of each component in the error vector 
— 4>k is bounded by the full norm 



|e,(^)||<CA^,, ||E,(^)||^<CA^^. 



(75) 



Appendix C 
Proof of Lemma[4] 

We prove this result by mathematical induction. During the 
proof, we will repetitively use the result in | [52| that for any 
given matrix Z and SZ, the matrix series expansion 



(z + sz)-^ = ^(-1)^+^ {z-^szy z-1 

q=0 

holds as long as ||Z~^(5Z|| < 1. 



(76) 



A. Initial Case: k = 1 

When k = 1 and given = for all i, we have 

\\xl-x]\\ < ||d°(£o)-<(£o)||, 
where the discrepancy is expressed explicitly as 
d^(4) - d'^iio) = [Ho + Eo,.(4)]"' [ho + eo,.(4)] (78) 
- [Ho + Eo,,(4)]"' [ho + eo,,(4)] • 

Thus, if the perturbations Eo,i(^o). Eo,j(^o) are small enough, 
the expansion in ([76]) can be applied here to simplify the 
expression. 

1 ) Matrix series expansion: Since a?^ = a?^ for alH such 
that Ho = ^{x^) and ho = r(ic-^), they can be bounded as 

||ho|| = ||r(.^?)||<^^, (79) 



Hn 



< 



(80) 



and by AssumptionH] we have io > i^. From Lemma [3] and 
again Assumption [4] the above upper bound can be further 
bounded as 

J J Ck 

l|Ho-i||Eo(£o)||^<— CAl°< 



mm 
2 

min 



a 1 + DC, 

Therefore, letting 5Z = Eo,i(4) or Eo,^(4) and Z = Hq, 
then it follows that the expansion holds. By grouping all the 
high order terms q >2 and bound them with (9 (/^^), we have 

4{£o) - d%l^) = Ho-i [Eo,.(4) - Eo,,(4)] Ho-^ho (81) 

-HoMeo,z(4)-eo,^(4)]+(!^(^'). 

2) Proof of success when k = 1: Similarly by Lemma |3j 
we can bound 

||Eo,.(4) - Eo,,(4)|| < 2CXl' = 0{k), 
||eo,^(4)-eo,,(4)|| <2CA^^° =(9(/.). 

From ( [5Q| ) and ( [ST] ), we have 
||d°(£o)-d°(^o)|| 

<CA^2(||H„-i|| + ||H„-i|n|ho||)+0(«2) 



<^max^^ 



C'l 



4 ' 2 

^min ^min 



< CC,Xi;Do + O (k^) , where I^o = A^^°-^*). 
Ignoring high order terms O (^^), we have 

and therefore the result holds for /c = 1. 

B. Induction: k = K and k = K -\- 1 

Let the error bound holds for k = K such that 



(82) 



"J I 



xrw <ccaI:Dk-i. 



(83) 



Similarly, the triangle inequality holds 



\X 



where 



< Wxf 



X 



K\ 



df{lK)-df{lK] 



df{iK)-df{eK) 

= [H^ + EkA^k)] [hK + eKM] 

- [H^ + ^KjieK)] [hK + BkA^ 



(84) 



Similar to the case when k = 1, if the perturbations EK,i{iK), 
Exj(^k) are small enough, the expansion in ( |76l ) can be 
applied here to simplify the expression. 

]) Matrix series expansion: By definition l(45]i, we have 



H 



K{xf) + Ak, 



K 



(85) 
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which is another perturbed inverse. Thus we first examine 
whether this inverse can be expanded using the series expan- 
sion in ([76]). From ( |46| ) and ( [83] ), we have 

where the last inequahty comes from the non-negativity of 
(i.e., D > Dk for all finite k). By Assumption [4] we have 

\\R-\xf)AK,i\\ < ||R-'(a^f ) 



and therefore the result holds fork = K-\-l given k = K. 

Appendix D 
Proof of Proposition[T] 



By the decomposition in ( [48] ), we have 



(89) 



(87) 



< 



< 



Da 



-n < n. 



Therefore, the matrix series expansion holds for ( [85] ). Then by 
grouping all the high order terms and bound them with 0{n^), 
using the above calculations we have 



fl-^ll < ||R-^(icf)|| + ||R"^(ic: 



K\ 



2 2 
^min ^min 



\AkA\+0{k^) 
(88) 



With ([88]l, we have 

\\R-k^^kA^k)\\ < ||H^i W^kA^kMI < ||H^i ||Eif(fif)| 

and there is £k ^ Then by Lemma[3]and again Assumption 
[4] the above upper bound can be further bounded as 

/ / 



= [R(ccf ) + Ak^i + Efe,i(4)] ^ [r(xf ) + Sk,i + efc,i(4)] 

Now that we verify that the matrix series expansion holds 
for similar approximations. First of all, from Lemma [4] and 
( [46] ), we have ||A/e^^(4)|| < y/^CCa^ij' Dk-i. The expansion 
depends on the quantity 

||R-i(xf)(A,,, + E,,,(4))|| 
< ||R-n^.')|| ||A,,|| + ||R-Ha.,^)|| ||E^,(4)|| . 
Using the derivation in ( [87] ), we have 

||R-i(;r,^)(Afe,,+Efe,,(4))|| 

1 



< n 



(1 + DC,) 



H 



K 



l|Eif(fK)||^< 



mm rr 

< k^O{k^) <C 1, 



I and the expansion holds. The difference can then be written 
by grouping the high order terms 0(j^) as 

d,^(4) - d\ = -Il-\x^) [A,,, + E,,,(4)] II-\x^)t{x^) 
-^Il-\x^)[6k,^^ek4ik)]^0{e^). 

Note that ||^/c,i(4)|| < lysCC^X^;^ D^-i, we have 



thus the matrix series expansion also holds for ( [84] ). Then 
by grouping all the higher order terms and bound them with 
O (/^^), we have 



(4) - d^ < {vACC,\l;Dk-i + CXi^ 

^{vsCC,Xl;Dk-i^CX'-] 



mm 



min 



2) Proof of success when k 

we can bound 



K ^ 1: From Lemma [3] 



^ ^ ^maxCgl 



I 



and accordingly 



Using ( [50] ) to bound the above inequality, we have 

4{ik)-d^\\<CCAl^Dk_,C,)Xl;. (90) 

Considering that D > Dk-i for any k, then using Assumption 
[4] gives us the following bound 



df{lK)-df{lK) 



K I 



d\{h)-d\ 



< K 



(91) 



< ca;^ 2 



H 



K 



H 



for all i and k. 



<CC,X^^^ ^0{k^). 
Finally by ignoring high order terms O (^^), we have 

||xf +1 - < ||xf - II + caA^- 
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